Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: hwhankins Date: Jul 13, 2012 3:53pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

I've run into this problem a lot. As an amateur radio operator I look up things on the web and there's a lot of good info on small amateur radio club sites and individual's sites that the domains have expired. Sometimes I can get the info through the archive but lately more and more of the time I'm running into the problem of the new owner of the domain (usually a domain parking service)using a robots.txt to delete the archive copy of the prior site.

This archive.org policy of allowing a new owner of a domain to destroy the archive of the previous owner's site is the same as allowing them to burn down a public library. The new owner of a domain has no legal right to control the previous owner's content.

Back deleting the archives should require a notarized statement under penalty of perjury that they own the content they're asking to be deleted.

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jul 13, 2012 11:48pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

But who's responsibility is this problem? The (new) domain owners aren't trying to erase the archives, they're (presumably) trying to get their current site not to be crawled, which as I understand it is the purpose of the robots.txt standard. I've looked at all the documentation linked from robotstxt.org, and none of them mention deleting archives based on robots.txt entries.

I can imagine that archive.org needs policies and procedures for owners of content to request its removal; perhaps they could extend robots.txt to add a "Delete:" directive (with syntax otherwise identical to "Disallow:")? So "Disallow:" would mean "don't crawl", and "Delete:" would mean "don't crawl, and forget you ever saw it".

For most search engines, there wouldn't be much difference (because they frequently re-crawl, and only serve search results from the most recent crawl), but for archive.org or other similar archives, the difference would be huge.

Now, it's entirely possible that many of these domain owners really do want the history of their domains deleted; then the problem is distinguishing between the current owner of a domain deleting history of content they provided, and the current owner of a domain trying to delete history of content a previous owner provided.

Note that the change of ownership of a domain doesn't automatically include change of ownership of copyright in the content previously served from the domain. But to fully respect that principle, archive.org would have to track domain ownership (which can be tricky, between whois privacy, and things like shell companies and subsidiaries blurring the concept of "same owner").

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Jul 25, 2012 6:56am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

Just thought I'd bump/clarify this; Archive.org does have a special "delete" directive, as pointed out in the first post. A site that wishes to erase its content has to copy those specific lines from the removal FAQ page to erase already archived content, it doesn't happen just by disallowing all robots. The issue here is that DomainSponsor is adding this delete directive, unilaterally, to hundreds of thousands of domains it buys; since DS is one of the largest expired domain resellers, this probably accounts for a significant percentage of ALL domains that have, or will ever, expire.

Reply to this post
Reply [edit]

Poster: jory2 Date: Jul 25, 2012 10:01am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

@CoJaBo - where is this "special "delete" " directive you're referring to? From what I read this website's policy(s) directs to no such "delete" command.

Words like: Block / Exclude / Disallow ... appear throughout this company's office policy(s), but I have never come across the word *DELETE*; have you?
In fact what appears to be said in their policy is that a simple block from free-public view is enacted.




This post was modified by jory2 on 2012-07-25 17:01:52

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jul 25, 2012 9:49am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

What "delete directive"?

Are you referring to:
User-agent: ia_archiver
Disallow: /

My understanding of the robots.txt standard is that that says "ia_archiver, don't crawl any part of this site." The robots.txt standard is completely silent on the subject of archives (because the standard was designed for search engine crawlers, which only publish the results of their most recent crawl).

How should I specify "don't crawl my site (so new content doesn't get added to the archives), but don't delete any old archives"? My understanding is DS is using the robots.txt protocol to indicate their desire that their domains' current content not be crawled or archived; I doubt they care about archives of their domains' previous history (before they bought them).

Is there a different User-agent string that controls just the crawl, and not the persistence of old archives?

Archive.org is choosing to allow the current owner of a domain to control the archives of the past history of that domain, even though the current owner likely doesn't have any legal ownership of the previous content.

Reply to this post
Reply [edit]

Poster: jory2 Date: Jul 25, 2012 10:43am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

@Jeremy Leader - I wasn't referring to any Disallow, I asked CoJaBo where he/she found the delete command.
Oddly enough I have never come across one for this website.

RE: Robots.txt, as I'm sure you know respecting the robots.txt file is voluntary, any informed site owner would not rely solely on such a file if their goal was to protect their Intellectual Property from being copied without permission or consent.

For what it's worth, the domain name _ _ _.com doesn't own the copyrights to the copyright protected material's that were used on that given domain name, the rightful owner of the works remains the owner.
Should the rightful owner wish not to renew or reactivate his/her website, that certainly doesn't mean the copyrights expired along with the domain name.
When the domain name is "parked" the copyrights and all Intellectual Property rights remain the property of the rightful owner. The rights are not automatically transferred to the new owner of the domain name, or companies like Domainsponsor.com.
You wrote: "I doubt they care about archives of their domains' previous history (before they bought them)."
What exactly do you think they bought?
Wouldn't common sense tell you and anyone else that all that was bought was the URL _ _ _.com, and not all the Rights to the Works that were used on the URL previously.








This post was modified by jory2 on 2012-07-25 17:43:56

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jul 25, 2012 1:50pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

Jory2, I believe I'm in complete agreement with you. My reference to a "delete directive" was in response to CoJaBo.

Can anyone confirm CoJaBo's claim that a directive like:

User-agent: *
Disallow: /

will NOT cause archive deletion (but will cause Internet Archive to stop crawling the site)?

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Aug 7, 2012 4:03pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+^W 24 million+ domains

Disallowing the user agent "*" will not cause removal; it will stop crawling as expected, but only the specific user-agent specified to on the removal FAQ page will actually cause removal of past content.

I do agree that there should be something explicitly stating "Remove" in the directive to prevent such a mistake; others off-site have pointed out that some "bad bot blacklists" also include these lines without explanation of what "ia_archiver" actually is- its possible DomainSponsor got it from a source like that and didn't realize it would cause *removal* of the content from the Archive as it would have been far separated from the removal FAQ entry at that stage.

If anyone's been following the list of sites registered to their nameserver (that is, sites being removed from the Archive in this way), its increased nearly two-hundred-fold since I made this post; the current count is over 24 *million*.
I'm not sure if this indicates they are expanding that rapidly or simply that that particular index site is just catching up with their existing registrations; I suspect the latter to be more likely.

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Aug 7, 2012 4:37pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+^W 24 million+ domains

OK, CoJaBo, thanks for that clarification.

So there's no way to say "Internet Archive, don't crawl my site, but don't delete the archive", while still allowing other crawlers to crawl the site?

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Aug 7, 2012 4:40pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+^W 24 million+ domains

It doesn't seem so; the FAQ only mentions those lines for removal, it doesn't seem to give an option for "don't crawl the site anymore, but still keep the existing content".

I had hoped someone from either DomainSponsor or the Archive would have responded to my emails by now.

Terms of Use (10 Mar 2001)